Recovering from a Disaster in an Exchange Server 2010 Environment : Preparing for a More Easily Recoverable Environment

1/8/2011 4:08:14 PM

If a full server recovery will be performed, or if a number of different procedures will be taken to install service packs, patches, updates, or other server-recovery attempts such as an attempt to recover the server, a full backup should be performed on the server.

At first, it might seem unnecessary to back up a server that isn’t working properly, but during the problem-solving and debugging process, it is quite possible for a server to end up in even worse shape after a few updates and fixes have been applied. The initial problem might have been that a single mailbox couldn’t be accessed, and after some problem-solving efforts, the entire server might be inaccessible. A backup provides a rollback to the point of the initial problem state. When making changes in an attempt to fix a server, you always want a way to roll back a change if it turns out to make the situation worse. When the backup is complete, verify that the backup is valid, ensuring that no open files are skipped during the backup process or that, if the files are skipped, they are backed up in other open file backup processes. This way, you will always have the ability to return to your starting point in case you need to try a different method to fix the server.

Caution

When performing any recovery of an Exchange server or resource, be careful what you delete, modify, or change. As a rule of thumb, never delete objects that are known throughout the directory; otherwise, you cannot restore the object because of the uniqueness of each object. As an example, if you plan to restore an entire server from tape, you do not want to first delete the server and then add the server back during the restoration process. The restoration process requires the existence of the old server in the directory. Deleting the server object and then adding the object again later gives the object a completely different globally unique identifier (GUID). Even though you restore the entire Exchange server from tape, the ID of the server and all the objects in the server will be different, making it more difficult to recover the server. Other replicable objects that should not be deleted include public folders, public folder trees, groups, and distribution lists.

Validating Backup Data and Procedures

Another important task that should be done before doing any maintenance, service, or repairs on an Exchange server is to validate that a full backup exists on the server, test the condition of the backup, and then secure the backup so that it is safe. Far too many organizations proceed with risky recovery procedures, believing that they have a fallback position by restoring from tape, only to realize that the tape backup is corrupt or that a complete backup does not exist. Equally important is to be sure that the tape you might need is actually onsite. Many companies send tapes offsite for storage. If you depend on a particular backup tape for your rollback, be sure it is readily accessible.

If the administrators of the network realize that there is no clean backup, the procedures taken to recover the system might be different than if a backup had existed. If a full backup exists and is verified to be in good condition, the organization has an opportunity to restore from tape if a full restore is necessary. This requirement is somewhat lessened in an environment where Database Availability Groups are utilized because those configurations can suffer a failure of a system should something go wrong during an upgrade or maintenance.

Steps can be taken to help an organization more easily prepare for a recoverable environment. This involves documenting server states and conditions, performing specific backup procedures, and setting up new features in Exchange Server 2010 that provide for a more simplified restoration process. By maintaining these processes and performing regular test restores, a company can feel confident that they can quickly and easily recover from a disaster. Most notably is the use of Database Availability Groups to provide for redundant mailbox services. Because the failover to another replica within a DAG is essentially transparent to the end users, it is considered a best practice with Exchange Server 2010 to utilize DAGs.

Documenting the Exchange Server Environment

Key to the success of recovering an Exchange server or an entire Exchange Server environment is having documentation on the server configurations. Having specific server configuration information documented helps to identify which server is not operational, the routing of information between servers, and, ultimately, the impact that a server failure or server recovery will have on the rest of the Exchange Server environment. By having a complete understanding of the Exchange Server environment as a whole, an administrator can often bring up temporary services to alleviate a failure and give themselves more time to fix the issue and determine the root cause.

Note

A utility called ExchDump can assist an administrator with baselining and improving the environment. Use ExchDump to export and document a server’s configuration. The ExchDump utility can be downloaded from the Microsoft Exchange Server download page at www.microsoft.com/exchange/downloads/2003/default.mspx.

Although this utility was originally written for Exchange Server 2003, it works fine for extracting the same information from an Exchange Server 2010 server.

Some of the items that should be documented include the following:

Server name
Server roles held
Version of Windows on servers (including service pack)
Version of Exchange Server on servers (including service pack)
Organization name in Exchange Server
Site names
Database names
Location of databases
Size of databases
When database maintenance was last run
Public folder tree name
Replication process of public folders
Security delegation and administrative rights
Names and locations of global catalog servers

Documenting the Backup Process

To simplify a restore of an Exchange Server environment, it is important to start with a clean backup. A clean backup is performed when the proper backup process is followed. Create a backup process that works, document the step-by-step procedures to back up the server, follow the procedures regularly, and then validate that the backups have been completed successfully.

Also, when configurations change, the backup process and system configurations should be documented and validated again, to make sure that the backups are completed properly.

Documenting the Recovery Process

An important aspect of recovery feasibility is knowing how to recover from a disaster. Just knowing what to back up and what scenarios to plan for is not enough. Restore processes should be created and tested to ensure that a restore can meet service level agreements (SLAs) and that the staff members understand all the necessary steps.

When a process is determined, it should be documented, and the documentation should be written to make sense to the desired audience. For example, if a failure occurs in a satellite office that has only marketing employees and one of them is forced to recover a server, the documentation needs to be written so that it can be understood by just about anyone. If the information technology (IT) staff will be performing the restore, the documentation can be less detailed, but it assumes a certain level of knowledge and expertise with the server product. The first paragraph of any document related to backup and recovery should be a summary of what the document is used for and the level of skill necessary to perform the task and understand the document.

The recovery process involved in resolving an Exchange Server problem should also be focused not only on the goal of getting the entire Exchange server back up and operational, but also on considering smaller steps that might help minimize downtime. As an example, if an Exchange server has failed, instead of trying to restore 10TB of mail back to the server, which can take hours, if not days, to complete, an organization can choose to restore just the user Inboxes, calendars, and contacts. After a faster system recovery of core information on a server, the balance of the information can be restored over the next several hours.

The other advantage of having a properly documented restore procedure is that it greatly reduces the chances of human error occurring during a restore. Recovering a failed server while hundreds or possibly thousands of email users are affected is a stressful situation. This isn’t the time to learn how to perform a restore. The goal in this situation is for the administrator to follow a clearly documented and well-tested process to ensure that no steps are missed and that no information is entered incorrectly. Having well-documented steps can greatly reduce the stress of this situation and increase the chances of a successful restore.

Even if an environment is utilizing DAG as their primary form of disaster recovery, there should still be a documented procedure of what to do in this situation. Although the rebuild of redundant systems can be delayed, the longer the delay, the more data will have to be incrementally reseeded and the longer a company is at a higher risk should other replicas fail.

Including Test Restores in the Scheduled Maintenance

Part of a successful disaster recovery plan involves periodically testing the restore procedures to verify accuracy and to test the backup media to ensure that data can actually be recovered. Most organizations or administrators assume that if the backup software reports “Successful,” the backup is good and data can be recovered. If special backup consideration is not addressed, the successful backup might not contain everything necessary to restore a server if data loss or software corruption occurs.

Restores of file data, application data, and configurations should be performed as part of a regular maintenance schedule to ensure that the backup method is correct and that disaster recovery procedures and documentation are current. Such tests also should verify that the backup media can be read from and used to restore data. Adding periodic test restores to regular maintenance intervals ensures that backups are successful and familiarizes the administrators with the procedures necessary to recover so that when a real disaster occurs, the recovery can be performed correctly and efficiently the first time.

These test restores should occur in a lab environment in which end users won’t be affected. The restores should vary in type, testing single mailbox restores, complete server restores, and full site restores in which even domain controllers might need to be restored from scratch. This helps ensure that staff members are comfortable with the process and will have no problem performing a restore in production should the occasion ever arise.